Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

Merged
merged 3 commits into from
Mar 13, 2022

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented Feb 19, 2022

ccl/sqlproxyccl: add last message details to the forwarder's processor

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: support waitResumed on the processor to block until resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: complete connection migration support in the forwarder

Informs #76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:

  • proxy.conn_migration.success
  • proxy.conn_migration.error_fatal
  • proxy.conn_migration.error_recoverable
  • proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure for the connection migration work is there. Note that this is untested, and was inspired mostly from the initial prototype. Putting this up in case anyone has thoughts on the direction. I'll be adding tests next.

When it comes to the forwarder's API, I plan to have a Forwarder interface that implements just two methods: RequestTransfer and Close. With the current design, callers wouldn't know whether the transfer has failed or succeeded after calling RequestTransfer. I think that is fine for now, and if we really need to know the last transfer error, we could address that in the upcoming RFC. That said, the initial thought is to have the forwarder notify a "forwarder tracking" component (i.e. map of pods to forwarders) if it connects/disconnects from serverConn. The proxy can then check which pod the forwarder is assigned to (versus the previous pod) to know whether the transfer has succeeded.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 52 at r1 (raw file):

)

var (

These error messages might be unnecessary, at least for the ones where they are only used once. We could just do return errors.New there if we wanted to.

pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved
pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved
pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 009f2d6 to e7730d1 Compare February 24, 2022 21:19
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

This draft PR is ready for another look. I added more tests, and changed how we handle timeout handlers. I also tested this, and it works as expected.

This test is still WIP, and is not included in this PR. Will update that later today. Here, we call RequestTransfer in a loop, and constantly query the SQL pod details. We can see that we alternate between pods, which implies that connection migration works:

=== RUN   TestConnectionMigration/successful_connection
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}

There are three more items to address:

  1. Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
  2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
  3. Add TestConnectionMigration (which is an end-to-end test) to this PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 269 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Can we do this without atomics? I do not like using atomics for control flow. Getting atomic programming right is infamously difficult. In this case there is a race condition that occurs if RequestTransfer is called concurrently and it's not clear to me how to fix it. At a minimum CompareAndSwap needs to be used as the guard for this state transfer. But then there is a race with setting transferKey and transferCh.

Done - good catch re: concurrent calls. As discussed, I have moved this to use locks. My benchmarks show that using locks within a 30s workload run only slows the overall time by a few hundred ms, so I'd be fine with using locks everywhere instead of a mixture of both.


pkg/ccl/sqlproxyccl/forwarder.go, line 440 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

idea: Instead of implementing the forwarding goroutines as a state machine, initiating the transfer could ask both of the goroutines to terminate. That allows us to implement transfer the same way authenticate is implemented. As a single function and goroutine. Once the transfer is canceled or completed, the goroutine managing the transfer could restart the forwarding goroutines.

IMO the main advantage to implementing it by stopping the forwarding goroutines is we can keep the steady state code simple and we can group all of the transfer logic together.

As discussed, this would increase the overall transfer latency since we'd have to wait for the server-to-client processor to stop before sending the transfer state message, so I kept the current approach. I like the idea as it makes things cleaner though, but the tradeoff wouldn't be worth it.


pkg/ccl/sqlproxyccl/forwarder.go, line 514 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This logic can be simplified by context.WithTimeout.

transferCtx, cancel := context.WithTimeout(timeout, ctx)
defer cancel()
err := transfer(trasferCtx)
if transferCtx.Err() && !ctx.Err() {
return errTransferTimeout
}
return err

Alternatively, if we do not specifically care about errTransferTimeout, it can use context.WithTimeout.

Hm, this probably isn't relevant anymore since I moved it elsewhere.

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from e7730d1 to 6a8d610 Compare February 25, 2022 05:32
@jaylim-crl jaylim-crl changed the title [DNM] ccl/sqlproxyccl: add connection migration support to the forwarder ccl/sqlproxyccl: add connection migration support to the forwarder Feb 25, 2022
@jaylim-crl jaylim-crl changed the title ccl/sqlproxyccl: add connection migration support to the forwarder ccl/sqlproxyccl: add partial connection migration support to the forwarder Feb 25, 2022
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 6a8d610 to 87b8ea5 Compare February 25, 2022 05:33
@jaylim-crl jaylim-crl marked this pull request as ready for review February 25, 2022 05:35
@jaylim-crl jaylim-crl requested a review from a team as a code owner February 25, 2022 05:35
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 87b8ea5 to e3bb0d3 Compare February 25, 2022 06:21
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three more items to address:

1 .Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
3. Add TestConnectionMigration (which is an end-to-end test) to this PR.

(2) and (3) are done. I've decided to just leave them in this PR, but happy to break them down into separate PRs if reviewers prefer otherwise. (1) will definitely come in a later PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/forwarder.go, line 629 at r3 (raw file):

// NOTE: This should only be called during a transfer process. We assume that
// transferCloserCh has already been initialized.
func (f *forwarder) runTransferTimeoutHandler(waitTimeout time.Duration) {

The end-to-end transfer process works, but I'm not happy with this function. The unfortunate thing is that we unblock clients through context, but for servers, we'd do that by closing the forwarder. This goroutine is needed to ensure that we can terminate if we timeout when sending the SHOW TRANSFER STATE query to the server. Also perhaps I ran into a rabbit hole with all the transferConnRecoverable stuff, but I don't see a better way to address that. Happy to discuss.

closer is needed instead of relying on transferCtx.Done() to ensure that we can distinguish between a transfer timeout, and an actual transfer completion. If we block on transferCtx.Done() in the client-to-server processor, then when the timeout handler gets triggered, we will unblock automatically, which is incorrect.

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 27, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I extracted some of the work here in #77109. I'll do the same for the auth stuff.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

Previously, we assumed that with the token-based authentication, the server is
ready to accept queries the moment we connect to it. This assumption has been
proved wrong during the integration tests with the forwarder, and there's
an implicit AuthenticateOK step that happens after connecting to the server.
During that time, initial connection data such as ParameterStatus and
BackendKeyData messages will be sent to the client as well. For now, we will
ignore those messages. Once we start implementing query cancellation within
the proxy, that has to be updated to cache the new BackendKeyData.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Feb 28, 2022
Informs cockroachdb#76000. Extracted from cockroachdb#76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

We also introduce a new error marker for errors that denote connections which
are recoverable. In our case, we convert SQL errors from first column of the
SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark
them with the errConnRecoverableSentinel sentinel.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from e3bb0d3 to 4357e45 Compare February 28, 2022 06:17
@jaylim-crl jaylim-crl changed the title ccl/sqlproxyccl: add partial connection migration support to the forwarder ccl/sqlproxyccl: complete connection migration support in the forwarder Feb 28, 2022
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from a095f21 to cd6e8dd Compare February 28, 2022 12:47
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 2fdda2d to fddfd30 Compare March 10, 2022 19:29
@jaylim-crl jaylim-crl requested a review from a team as a code owner March 10, 2022 19:29
@jaylim-crl jaylim-crl removed the request for review from a team March 10, 2022 19:30
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this is ready for another round of review. Some tests are TODO, which will be addressed before merging, but I'm putting this up for a review first. Only look at the last three commits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from 97f79e3 to 9661512 Compare March 10, 2022 21:02
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I have a few minor comments.

pkg/ccl/sqlproxyccl/forwarder.go Show resolved Hide resolved
pkg/ccl/sqlproxyccl/forwarder.go Show resolved Hide resolved
pkg/ccl/sqlproxyccl/forwarder.go Show resolved Hide resolved
pkg/ccl/sqlproxyccl/conn_migration.go Show resolved Hide resolved
}
}()

f.mu.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor)  {
 	f.mu.Lock()
	defer f.mu.Unlock()
	return f.mu.request, f.mu.response
}

func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
	f.mu.Lock()
	defer f.mu.Unlock()
	// Transfer was successful.
	clockFn := makeLogicalClockFn()
	f.mu.serverConn.Close()
	f.mu.serverConn = newServerConn
	f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
	f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}

...

request, response := getProcessors()    

// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending response processor")
}

// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
    return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
	
return nil

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None
…resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 9661512 to d6a7488 Compare March 11, 2022 16:50
@jaylim-crl jaylim-crl requested a review from jeffswenson March 11, 2022 16:51
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I don't think logCtx is necessary. I looked at the implementation of log and I did not find any calls that check ctx.Err(). If it did, I would argue that is a bug. It is normal to log something that goes wrong during the cancellation process.

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

pkg/ccl/sqlproxyccl/conn_migration.go, line 148 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor)  {
 	f.mu.Lock()
	defer f.mu.Unlock()
	return f.mu.request, f.mu.response
}

func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
	f.mu.Lock()
	defer f.mu.Unlock()
	// Transfer was successful.
	clockFn := makeLogicalClockFn()
	f.mu.serverConn.Close()
	f.mu.serverConn = newServerConn
	f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
	f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}

...

request, response := getProcessors()    

// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending response processor")
}

// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
    return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
	
return nil

Done. Updated it to your suggestion, but also introduced getConns for the client and server connections.


pkg/ccl/sqlproxyccl/forwarder.go, line 190 at r18 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

😄 On a 100gbps ethernet line, sending the smallest possible pg message, with no protocol overhead, it would take 217 years of continuous transmission to overflow an int64 counter.

>>> line_rate = 100 * 2 ** 30 # 100 gibs
>>> message_size = 5 * 8 # minimum message size of 8 bytes
>>> msg_per_second = line_rate / message_size
>>> seconds_to_overflow = 2 ** 64 / msg_per_second 
>>> years_to_overflow = seconds_to_overflow / (60 * 60 * 24 * 365)
>>> years_to_overflow
217.908031253171

😄


pkg/ccl/sqlproxyccl/forwarder.go, line 320 at r18 (raw file):

It would require passing peekErr and typ as arguments.

Doing this muddies the logic in exitPeek, and I'd rather keep it this way with the extra mutex lock+unlock. If this proves to be an issue, we could easily optimize this at a later time. Doing it the current way also makes sure that updating the last message details is tied to the ForwardMsg call, and not when exiting.


pkg/ccl/sqlproxyccl/forwarder.go, line 83 at r20 (raw file):
Done. Added a note about that.

Generally I would advise against acquiring multiple locks at the same time because it is a deadlock risk.

I think as long as we're aware about it, such risks can be minimized. Also, using a single lock for all three structs is going to cause contention between the request and response processors especially when queries are pipelined.

@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from d6a7488 to 64166a9 Compare March 11, 2022 18:09
Informs cockroachdb#76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:
- proxy.conn_migration.success
- proxy.conn_migration.error_fatal
- proxy.conn_migration.error_recoverable
- proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/connection-migration branch from 64166a9 to 8610953 Compare March 11, 2022 18:28
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

Jay and I dug into this off line. The root cause is the sqlproxy is using stopper incorrectly. This will be fixed in a separate PR. #77689


pkg/ccl/sqlproxyccl/forwarder.go, line 158 at r16 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

@jeffswenson, how strongly do you feel that we need to guard these variables with a mutex when they don't need them? I've attempted to do so, but it was extremely difficult to read/follow, so I aborted the idea.

This was resolved by the redesign.

Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)


pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

The root cause is the sqlproxy is using stopper incorrectly.

I guess it's more like we're not using the stopper entirely. Some places use stopper.RunAsyncTask, whereas others use the go keyword directly. The latter is handy as we don't need to thread the stopper all the way down. Fixing all uses of the go keyword isn't straightforward. We'd also have to rethink how we want traces to be displayed.

@jaylim-crl
Copy link
Collaborator Author

bors r=JeffSwenson

@craig
Copy link
Contributor

craig bot commented Mar 13, 2022

Build succeeded:

@craig craig bot merged commit dc5432d into cockroachdb:master Mar 13, 2022
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 13, 2022
Previously, we added support for connection migration in cockroachdb#76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 13, 2022
…e size

Informs cockroachdb#76000, and follow up to cockroachdb#76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 14, 2022
Previously, we added support for connection migration in cockroachdb#76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Mar 14, 2022
…e size

Informs cockroachdb#76000, and follow up to cockroachdb#76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 14, 2022
77442: sqlproxyccl: Add codeUnavailable to the list of error codes r=alyshanjahani-crl a=alyshanjahani-crl

Release justification: fixes for high-priority bug in existing functionality

Previously, if a tenant cluster had maxPods set to 0 the error returned by
directory.EnsureTenantAddr was not treated as a non-retryable error.

The tenant directory implementation used in CockroachCloud now identifies
this situation and returns a status error with codes.FailedPrecondition and
a descriptive message.

This patch adds a check for the FailedPrecondition error in
connector.lookupAddr.

Release note: None

77626: sql: re-enable txnidcache by default r=maryliag,ajwerner a=Azhng

Previously, TxnID cache was disabled by default due to performance
issues. This has since been resolved in #77220.
This commit re-enables TxnID Cache by default.

Release justification: low risk high reward change.
Release note: None

77675: cmd/roachtest: tweaks to sstable corruption test r=itsbilal a=nicktrav

A few minor changes to the existing roachtest:

Reduce the warehouse count on ingest back to 100 - this should be ample,
and it matches the warehouse count on the read path in a later part of
the test.

Change the stop options to wait until the process has been terminated
before continuing.

Re-work the command to search for SSTables to corrupt - use tables from
the most recent manifest (if multiple are present); consider SSTables
with either a start or end key representing a user table; shuffle tables
to distribute corruption over the LSM; perform filtering in the bash
command rather than in the test runner itself.

Release justification: Roachtest-only change.

Release note: None.

Touches #77321.

77700: ccl/sqlproxyccl: add connection migration-related metrics r=JeffSwenson a=jaylim-crl

#### ccl/sqlproxyccl: add metric that measures connection migration latency

Previously, we added support for connection migration in #76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

#### ccl/sqlproxyccl: add metric that records the transfer response message size

Informs #76000, and follow up to #76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

Co-authored-by: Alyshan Jahani <[email protected]>
Co-authored-by: Azhng <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Jay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants